Skip to content

feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE inference on Blackwell GPUs with memory copy optimizations#4490

Open
hd9568 wants to merge 1 commit intoInternLM:mainfrom
hd9568:feature/blackwell-moe-opt
Open

feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE inference on Blackwell GPUs with memory copy optimizations#4490
hd9568 wants to merge 1 commit intoInternLM:mainfrom
hd9568:feature/blackwell-moe-opt

Conversation

@hd9568
Copy link
Copy Markdown

@hd9568 hd9568 commented Apr 3, 2026

Motivation

TurboMind’s existing MoE path relies on CUTLASS-style fused/grouped kernels that target SM90. On NVIDIA Blackwell (SM100, e.g. B200), that path is not a drop-in replacement: building SM90 kernels for SM100 toolchains is problematic, and MoE inference needs a stable, vendor-supported grouped GEMM.

This PR adds a cuBLAS Grouped Batched GEMM path (cublasGemmGroupedBatchedEx, CUDA 12.5+) for BF16/FP16 MoE FFN on SM100, so models such as Qwen3.5 MoE can run on Blackwell. It also reduces per-launch overhead in the grouped cuBLAS launcher (fewer synchronizations, no per-call device malloc for pointer arrays, reuse of pre-allocated workspace) and applies a safe fallback where cuMemcpyBatchAsync is known to misbehave on SM100.

Goal: Correct and more efficient MoE inference on Blackwell without breaking existing architectures (H100 and below keep their current kernel selection).


Modification

  • Build / arch

    • Add CUDA arch 100a-real (B200) when using CUDA ≥ 12.8.
    • Split SM90 CUTLASS GEMM sources into a separate static target (gemm2_sm90) compiled only for 90/90a, so SM100-only builds remain valid while H100 compatibility is preserved when SM90 objects are still linked.
    • Define ENABLE_CUBLAS_GROUPED when targeting SM100 and CUDA ≥ 12.5; register CublasGroupedKernel in the GEMM registry for arch ≥ 1000.
  • cublas.cu

    • Implement CublasGroupedKernel wrapping cublasGemmGroupedBatchedEx with the documented row-major ↔ col-major mapping for MoE (ragged M per expert).
    • Optimize Launch: merged D2H where needed, reuse workspace.tensormaps for device-side A/B/C pointer tables (no cudaMallocAsync per call), stream ordering instead of extra barriers where safe, cublasSetWorkspace from workspace.partials, single-pass construction of active groups.
  • Weight / layout

    • convert_v3.cu: On SM100, skip tiled weight conversion for grouped BF16/FP16 so weights stay in the layout expected by grouped cuBLAS.
    • LlamaDenseWeight.cc: On SM100 grouped path, disable fused GatedSiLU so activation runs outside the plain GEMM epilogue.
    • LlamaLinear.cu: Extend MoE token gather to BF16/half when unfused grouped path is required (aligned with FP8 gather + scale dispatch behavior).
  • Stability

    • copy.cc: On SM100+, avoid cuMemcpyBatchAsync (crash workaround); use sequential cudaMemcpyAsync via existing core::Copy, with cached compute-capability check to avoid querying the device every Run().
  • Misc

    • moe_utils_v2.h: Add #pragma once.
    • arch.h: Add Sm100 and compatibility wiring.

BC-breaking (Optional)

No intentional API or config break for Python users or existing TurboMind deployments.

  • Build: Projects that only built for SM100 without SM90 may now pull in gemm2_sm90 automatically when the CMake logic enables it for H100 compatibility; artifact size may increase slightly for fat binaries.
  • Runtime: Behavior change is scoped to SM100+ and grouped BF16/FP16 MoE: slightly different fusion boundary (GatedSiLU unfused) vs fused CUTLASS path on older GPUs. Numerics should remain in family with the existing “unfused activation” reference path; any strict bit-identical requirement across backends is not guaranteed.

Downstream forks that patch MoE or GEMM registration should rebase carefully; others need no code changes.


Use cases (Optional)

  • Deploy Qwen3.5 (or similar) MoE models on Blackwell (B200) with TurboMind / lmdeploy when built with CUDA 12.5+ and SM100 in CMAKE_CUDA_ARCHITECTURES.
  • Same codebase continues to serve H100 / A100 via existing CUTLASS + cuBLAS paths when SM90 kernels are enabled in the build.

(Optional doc follow-up: mention Blackwell + grouped cuBLAS MoE in TurboMind build notes or supported-hardware table if the project maintains one.)


Checklist

  1. Pre-commit / lint: Please run pre-commit run --all-files (or project CI) before merge; fix any reported issues.
  2. Tests: MoE + SM100 path is hardware-specific; full coverage in CI may not exist. If possible, add or extend a small regression test on a machine with B200, or document manual test steps (model + command). Otherwise state manual verification in the PR discussion.
  3. Downstream versions: N/A beyond CUDA 12.5+ and a driver that supports cublasGemmGroupedBatchedEx on SM100.
  4. Docs: Optional update to build / hardware docs for Blackwell MoE; not strictly required if maintainers prefer a follow-up PR.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an SM100 (Blackwell) MoE GEMM backend for TurboMind by integrating a cuBLAS grouped batched GEMM path and adjusting build/runtime logic so SM100 builds can coexist with SM90 CUTLASS kernels while working around SM100-specific memcpy instability.

Changes:

  • Add cublasGemmGroupedBatchedEx-based grouped GEMM kernel for BF16/FP16 MoE on SM100, with workspace reuse and reduced per-launch overhead.
  • Update build/arch plumbing (SM100 arch, split SM90 kernels into a separate target, conditional registration/compilation defines).
  • Adjust MoE weight/layout and copy behavior for SM100 (skip tiled conversion for grouped BF16/FP16, unfuse GatedSiLU, avoid cuMemcpyBatchAsync on SM100).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/turbomind/models/llama/LlamaLinear.cu Extends MoE gather to cover BF16/FP16 for unfused SM100 grouped path.
src/turbomind/models/llama/LlamaDenseWeight.cc Disables fused GatedSiLU under SM100 grouped BF16/FP16 constraints.
src/turbomind/kernels/gemm/registry.h Adds registry hook for SM100 grouped cuBLAS kernel.
src/turbomind/kernels/gemm/registry.cu Conditionally registers SM90 kernels and SM100 grouped cuBLAS kernel.
src/turbomind/kernels/gemm/moe_utils_v2.h Adds #pragma once.
src/turbomind/kernels/gemm/cublas.cu Implements CublasGroupedKernel using cublasGemmGroupedBatchedEx.
src/turbomind/kernels/gemm/convert_v3.cu Skips tiled conversion for SM100 grouped BF16/FP16 to match cuBLAS expectations.
src/turbomind/kernels/gemm/CMakeLists.txt Splits SM90 kernels into gemm2_sm90, enables SM100 grouped cuBLAS via compile defs.
src/turbomind/kernels/gemm/arch.h Adds Sm100 and updates compatibility ranges.
src/turbomind/core/copy.cc Adds SM100 workaround to avoid cuMemcpyBatchAsync.
CMakeLists.txt Adds 100a-real CUDA arch for B200 when CUDA ≥ 12.8.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// MoE gather: FP8 always. BF16/half when need_unfused_moe_gather (no idxs gather; e.g. SM100 grouped cuBLAS).
const bool need_unfused_moe_gather =
(int)A.shape(0) != m && dense.epilogue != Epilogue::kGatedSilu;
if (indices && (A.dtype() == kFloat8_e4m3 || need_unfused_moe_gather)) {
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need_unfused_moe_gather can make this branch call invokeMoeDispatch() for non-FP8 dtypes (e.g. float32) when indices is set and dense.epilogue != kGatedSilu. invokeMoeDispatch() only supports 8-bit and 16-bit element sizes and will TM_CHECK(0) on other types, so this can become a runtime crash for float models/tests. Consider tightening the condition to only enable the unfused gather for kHalf/kBfloat16 (or byte_size(A.dtype()) == 2) in addition to FP8.

Suggested change
if (indices && (A.dtype() == kFloat8_e4m3 || need_unfused_moe_gather)) {
const bool supports_unfused_moe_gather =
A.dtype() == kFloat8_e4m3 || A.dtype() == kHalf || A.dtype() == kBfloat16;
if (indices && (A.dtype() == kFloat8_e4m3 || (need_unfused_moe_gather && supports_unfused_moe_gather))) {

Copilot uses AI. Check for mistakes.
// weight descriptor as Adesc; weight has no valid offsets -> Adesc.offsets=(nil) and Launch fails.
if (desc.group_axis != 0) {
return false;
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CublasGroupedKernel::is_feasible() does not verify desc.order_a/order_b/order_c against the layout assumptions in Launch (row-major A & D, col-major interpreted weight). Without these checks, this kernel could be selected for grouped GEMMs with different operand orders and then compute incorrect results. Add explicit order checks (and/or reuse Kernel::is_feasible logic for orders) while still allowing both kHalf and kBfloat16.

Suggested change
}
}
// Launch assumes row-major A and D/C, with B interpreted as column-major weight.
if (desc.order_a != Order::kRowMajor || desc.order_b != Order::kColMajor
|| desc.order_c != Order::kRowMajor) {
return false;
}

Copilot uses AI. Check for mistakes.
Comment on lines +281 to +289
if (weight_is_strided_ptrs) {
const uintptr_t kBadB = 0x320936400ULL;
if (B == nullptr || reinterpret_cast<uintptr_t>(B) == kBadB) {
fprintf(stderr, "[TM][GEMM] CublasGrouped: B null or bad (B=%p)\n", (void*)B);
return 1;
}
cudaPointerAttributes attr{};
if (cudaPointerGetAttributes(&attr, B) != cudaSuccess || attr.type != cudaMemoryTypeDevice) {
fprintf(stderr, "[TM][GEMM] CublasGrouped: B not device ptr (attr.type=%d)\n", (int)attr.type);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard-coded “bad pointer” sentinel value 0x320936400ULL is unexplained and appears to be debugging residue. It can cause false positives and is difficult to maintain/justify as a correctness check. Prefer relying on cudaPointerGetAttributes (and/or other validated invariants) or gate this sentinel check behind a debug-only macro with a clear explanation of its origin.

Copilot uses AI. Check for mistakes.
// Use pre-allocated workspace for device pointer arrays (no cudaMalloc/Free per call)
const size_t one_array = active_count * sizeof(void*);
const size_t total_bytes = 3 * one_array;
TM_CHECK_LE(total_bytes, workspace.tensormaps_size);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TM_CHECK_LE(total_bytes, workspace.tensormaps_size) will abort the process if the workspace is too small. Since this is a runtime property (depends on active_count), consider returning a nonzero error code with a clear message instead, so callers can fail gracefully (similar to the other argument validation in this function).

Suggested change
TM_CHECK_LE(total_bytes, workspace.tensormaps_size);
if (total_bytes > workspace.tensormaps_size) {
fprintf(stderr,
"[TM][GEMM] CublasGrouped: workspace.tensormaps_size too small "
"(required=%zu, available=%zu, active_count=%d)\n",
total_bytes,
workspace.tensormaps_size,
active_count);
return 1;
}

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +90
// cuMemcpyBatchAsync is known to crash on sm_100 (Blackwell); use sequential path.
int device = 0;
(void)cudaGetDevice(&device);
int major = 0;
(void)cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device);
if (major >= 10) {
for (unsigned i = 0; i < src_.size(); ++i) {
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SM100 workaround queries cudaGetDevice + cudaDeviceGetAttribute(cudaDevAttrComputeCapabilityMajor, ...) on every BatchCopy::Run() call. BatchCopy can be used in tight loops, so this adds avoidable overhead. Consider caching the compute capability once (e.g. static local, or a member initialized in the constructor, with a per-device cache if needed).

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +73
set(_has_sm100 FALSE)
set(_archs_100 "${CMAKE_CUDA_ARCHITECTURES}")
list(FILTER _archs_100 INCLUDE REGEX "^100")
if(_archs_100 AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.5")
set(_has_sm100 TRUE)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_has_sm100 is set but never used. If it’s not needed, remove it; if it is meant to drive later logic, wire it up so the intent is clear (unused variables in CMake can hide configuration bugs).

Suggested change
set(_has_sm100 FALSE)
set(_archs_100 "${CMAKE_CUDA_ARCHITECTURES}")
list(FILTER _archs_100 INCLUDE REGEX "^100")
if(_archs_100 AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.5")
set(_has_sm100 TRUE)
set(_archs_100 "${CMAKE_CUDA_ARCHITECTURES}")
list(FILTER _archs_100 INCLUDE REGEX "^100")
if(_archs_100 AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.5")

Copilot uses AI. Check for mistakes.
… inference on Blackwell GPUs with memory copy optimizations

Co-authored-by: 2016202110015 <2016202110015@whu.edu.cn>
@hd9568 hd9568 force-pushed the feature/blackwell-moe-opt branch from 69d9d5e to 0d53c58 Compare April 3, 2026 09:20
@hd9568
Copy link
Copy Markdown
Author

hd9568 commented Apr 3, 2026

@cassiewilliam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants